Introduction

Column

Introduction

Next Generation RNA-Sequencing is one of the most popular techniques in the field of molecular biology. It facilitates the opportunity to compare the pattern of gene expression of any cell or tissue of any biological system. Moreover, it provides the advantage to compare the pattern of gene expression across different growth conditions. The human body always responds to environmental stimuli by a certain pattern of gene expression. These stimuli can be drug effect, tissue grafting or diseases, etc. can also be used to determine how the cells are behaving in a certain condition (stressed, relaxed, etc.). The purpose of the following project is to show the difference in the pattern of gene expression among three different tissue types such as liver, adipose, and bone marrow. I am specifically looking at the Next Generation RNA-Sequencing(NGS) data generated by Mayo Clinic, Rochester Minnesota on a project of tissue grafting project in the liver, adipose, and bone marrow transplant. Comparing tissue for their differential gene expression is one of the common methods in molecular biology. Each RNA-sequencing produces a massive amount of gene expression data consists of thousands of genes. Among all those genes some of them do not significantly differ across conditions. Therefore massive analysis of those genes is important to identify candidate genes that may differ significantly across conditions/tissue types for their biological importance.

In this project, I will analyze a dataset from the National Center for Biotechnology Information Gene Expression Omnibus (NCBI-GEO) database to show that genes expression may vary across different tissue grafting (liver, bone marrow, adipose). My data analysis approach includes data normalization, mean expression level calculation, and t-test of significance. At the end of the study, I will be able to tell which genes vary significantly across tissue types. Also, I will be able to tell how many genes expression significantly changes across conditions.

Required Packages For performing the data analysis, I have used the below-described packages. The custom written function will check whether all required packages are installed. If any of these packages are missing, the custom function will install those packages that will make them available them to the environment. Please check the source code for the function.

Data Processing

Column

Importance of the Dataset

Mesenchymal Stromal Cells(MScs) are cells that can be isolated from different parts from the human body such as liver, bone marrow, and adipose tissue. Since they are omnipresent in our body, their different location results in the different structural component. Therefore they present different patterns of gene expression. For this reason, gene expression in MSCs isolated from liver, bone marrow, and adipose tissue is worthy of looking. Moreover, these different types of tissues may respond differently during different environmental conditions such as drug, treatment, surgery, etc. The above study has been performed when MSCs are used for tissue transplant treatment after liver, adipose tissue, and bone marrow transplant. I have found the study dataset (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE145728) interesting for several reasons. First of all this dataset has three different samples. Second, each sample has enough replicates (6 liver, 3 adipose, and 3 bone marrow) to perform faithful statistical analysis of transcriptome (RNA-Seq) data. This will help me to find the differential gene expression to compare the three tissue types on how they respond to tissue grafting after transplant surgery.

Data Source and Importing the Data

The dataset was downloaded from the National Center of Biotechnology Information Gene Expression Omnibus(NCBI-GEO) database (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE145728). NCBI-GEO is an open-source collection of Next Generation RNA Sequencing and Microarray study data. The data can be found as excel format and with or without preprocessing. The dataset I am using is very new and just appeared in this database. I have downloaded the data and added the data in the application folder. This dataset contains the Observation for 14741 genes from all 23 human chromosomes. For each of these 14471 genes, the dataset has the following variables

  • Chromosome Number(Chr)

  • Ensemble Gene Identification (GeneID)

  • Gene Name(GeneName)

  • Start Codon

  • Stop Codon

  • CodingLength

  • Gene Expression of 3 Liver Sample (Liver.1, Liver.2, Liver.3)

  • Gene Expression of 3 Adipose Sample (Adipose.1, Adipose.2, Adipose.3

  • Gene Expression of 3 Bone Marrow Samples (Marrow.1, Marrow.2, Marrow.3)

FFor my project, the gene expression columns are the most important. All further statistical analysis was performed on these variables.

Data Processing Missing Value RNA-Sequencing is a very sensitive technique thus does not generate any missing value. I checked the dataset to find out the presence of any NA/na columns and did not find any Log Transformation of the Gene Expression Column RNA-Seq gene expression data vary widely, therefore dataset needs to be transformed. I have used the log2 based transformation for the gene expression columns (Liver.1,2,3; Adipose.1,2,3; Marrow.1,2,3)

Column

Data

The data is presented below has 14471 rows and 15 columns. I have used the ‘Reactable’ package which helps to interactively access the raw data frame.

Exploratory Analysis

Exploratory analysis is performed to- 1. Check Whether the genes are normally distributed after the log2 transformation 2. Correlation-plot of check whether their expression is differentially expressed

Distribution of The Gene Expression of Liver, Adipose, and Bone Marrow Tissue.


The histograms are showing that the gene expression data is normally distributed across three different sample types. This is pretty good observation in the sense that most of the genes are expressed by following a specific pattern (going up or down at the same time). Since we have used the log2 transformation, we found the normal distribution is self-explanatory. Even for cancer when there is abnormal expression happens, almost 50% of genes follow the normal distribution. But it is a sanity check before proceeding to the next step.

Correlation Among Three Different Tissue Types in Term of Gene Expression.

***

The correlation graph showing that differential gene expression among three different tissue types is mostly similar rather than differential. For all the graphs, the gene expression in both X and Y-axis falls in the trendline (around ~75% genes). On the other hand, there are several genes that are staying apart from the trendline and are suspected genes that may be expressed differently.

Statistical Analysis

Calculting Mean for Each Tissue Types (Liver, Bone Marrow, Adipose) Across Replicates

Result

Using the above table we have found a couple of important information. It is impossible to go through all of the genes that are significantly different between the two tissue types. Between the liver and the adipose tissue, the top three genes that are differentially expressed are LAMA5, CTC518P12.6, and C5orf38. Between the liver and bone marrow tissue types, the top three genes are CCDC170, AC010894.5, and FGD4. Finally, between adipose and bone marrow tissue types, the top three genes are ACTN4P1, C1orf54, and POLR3B. For further information, you can access the above interactive table.

2402 out of 14471(16.6 %) genes are differentially expressed between liver and adipose tissue

2918 out of 14471(20.2 %) genes are differentially expressed between liver and bone marrow tissue

1604 out of 14471(11.1 %) genes are differentially expressed between adipose and bone marrow tissue

Biography

Anamul Haque

Ph.D. Student, Biomedical Data Science, and Informatics Program,

Clemson University

540-525-732

I am a second-year graduate student at the School of Computing and currently working with Professor William Richardson at the Clemson Bioengineering department. Before joining Clemson University, I achieved my Masters in Biology from Virginia Tech, and Bachelors in Microbiology from the University of Dhaka, Bangladesh. I have 8 years of teaching and research experience in academia and research institutes as a research and teaching assistant. I have enrolled in this class because I never took a Statistics class in R though I have taken several classes in statistics. This class helped me to understand R more profoundly than any other class. For my future, I am interested to work in the field of Biomedical Informatics, especially in the field of Text Mining and Natural Language Processing. Driving is my passion and I drive almost 30,000 miles a year. In my leisure time, I also love to tweak with computer hardware. I am currently living in Greenwood, South Carolina with my wife who is an employee of the Greenwood Genetic Center.

Photo of Anamul Haque

Anamul Haque

Anamul Haque

---
title: "Comparison of Gene Expression Among Three Different Tissue Types"
output: 
  flexdashboard::flex_dashboard:
             social: menu
             source: embed
             vertical_layout: fill
             orientation: columns
             theme: flatly
---

```{r,message=F, warning=F, echo=TRUE, include = FALSE}
# Check_package function will check whether the package is installed 
check_packages <- function(pkg){
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    if (length(new.pkg)) 
        install.packages(new.pkg, dependencies = TRUE)
    sapply(pkg, require, character.only = TRUE)
}
# Usage example
packages<-c("tidyr","dplyr","ggplot2","readr","magrittr","reactable","flexdashboard")
check_packages(packages)
```

Introduction
=======================================================================
Column {data-width=400}
-----------------------------------------------------------------------

***Introduction***

Next Generation RNA-Sequencing is one of the most popular techniques in the field of molecular biology. It facilitates the opportunity to compare the pattern of gene expression of any cell or tissue of any biological system. Moreover, it provides the advantage to compare the pattern of gene expression across different growth conditions. The human body always responds to environmental stimuli by a certain pattern of gene expression. These stimuli can be drug effect, tissue grafting or diseases, etc.  can also be used to determine how the cells are behaving in a certain condition (stressed, relaxed, etc.). The purpose of the following project is to show the difference in the pattern of gene expression among three different tissue types such as liver, adipose, and bone marrow. I am specifically looking at the Next Generation RNA-Sequencing(NGS) data generated by Mayo Clinic, Rochester Minnesota on a project of tissue grafting project in the liver, adipose, and bone marrow transplant. Comparing tissue for their differential gene expression is one of the common methods in molecular biology. Each RNA-sequencing produces a massive amount of gene expression data consists of thousands of genes. Among all those genes some of them do not significantly differ across conditions. Therefore massive analysis of those genes is important to identify candidate genes that may differ significantly across conditions/tissue types for their biological importance.

In this project, I will analyze a dataset from the National Center for Biotechnology Information Gene Expression Omnibus (NCBI-GEO)  database to show that genes expression may vary across different tissue grafting (liver, bone marrow, adipose). My data analysis approach includes data normalization, mean expression level calculation, and t-test of significance. At the end of the study, I will be able to tell which genes vary significantly across tissue types. Also, I will be able to tell how many genes expression significantly changes across conditions.


```{r, out.width= "800px", echo=FALSE, fig.align='center'}
knitr::include_graphics("gene_expression_level_comparison.jpg")
```

**Required Packages**
For performing the data analysis, I have used the below-described packages. The custom written function will check whether all required packages are installed. If any of these packages are missing, the custom function will install those packages that will make them available them to the environment. Please check the source code for the function.

```{r, ,message=F, warning=F, echo=TRUE}
library(tidyr)           # this package used to clean dataset 
library(dplyr)           # this package do data transformation and help to format a needed dataframe
library(ggplot2)         # this package creates colorful graphs
library(readr)           # this package is essential to import dataset into R
library(magrittr)        # this package pipes dataframe to functions
library(flexdashboard)   # this package creates interactive dashboards
library(reactable)       # this package creates beautiful tables
```

Data Processing
======================================================================
Column{.tabset}
----------------------------------------------------------------------

***Importance of the Dataset***

Mesenchymal Stromal Cells(MScs) are cells that can be isolated from different parts from the human body such as liver, bone marrow, and adipose tissue. Since they are omnipresent in our body, their different location results in the different structural component. Therefore they present different patterns of gene expression. For this reason, gene expression in MSCs isolated from liver, bone marrow, and adipose tissue is worthy of looking. Moreover, these different types of tissues may respond differently during different environmental conditions such as drug, treatment, surgery, etc. The above study has been performed when MSCs are used for tissue transplant treatment after liver, adipose tissue, and bone marrow transplant. I have found the study dataset (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE145728) interesting for several reasons. First of all this dataset has three different samples. Second, each sample has enough replicates (6 liver, 3 adipose, and 3 bone marrow) to perform faithful statistical analysis of transcriptome (RNA-Seq) data. This will help me to find the differential gene expression to compare the three tissue types on how they respond to tissue grafting after transplant surgery. 


***Data Source and Importing the Data***

The dataset was downloaded from the National Center of Biotechnology Information Gene Expression Omnibus(NCBI-GEO) database (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE145728). NCBI-GEO is an open-source collection of Next Generation RNA Sequencing and Microarray study data. The data can be found as excel format and with or without preprocessing. The dataset I am using is very new and just appeared in this database. I have downloaded the data and added the data in the application folder. This dataset contains the Observation for 14741 genes from all 23 human chromosomes. For each of these 14471 genes,  the dataset has the following variables 


* Chromosome Number(Chr) 

* Ensemble Gene Identification (GeneID)

* Gene Name(GeneName)

* Start Codon

* Stop Codon

* CodingLength

+ Gene Expression of 3 Liver Sample (Liver.1, Liver.2, Liver.3)

+ Gene Expression of 3 Adipose Sample (Adipose.1, Adipose.2, Adipose.3

+ Gene Expression of 3 Bone Marrow Samples (Marrow.1, Marrow.2, Marrow.3)

FFor my project, the gene expression columns are the most important. All further statistical analysis was performed on these variables.   

```{r,message=F, warning=F, echo=TRUE}
# Reading the data in R                                       
df <- read.csv("DE_Liver_BM_AD_MSC_3.csv", header = TRUE)

```
***Data Processing***
*Missing Value*
RNA-Sequencing is a very sensitive technique thus does not generate any missing value. I checked the dataset to find out the presence of any NA/na columns and did not find any
*Log Transformation of the Gene Expression Column*
RNA-Seq gene expression data vary widely, therefore dataset needs to be transformed. I have used the log2 based transformation for the gene expression columns (Liver.1,2,3; Adipose.1,2,3; Marrow.1,2,3)

```{r,message=F, warning=F, echo=TRUE}
#log2 transforming the dataset
df[, 7:15] <- log(df[, 7:15],2)
```

Column {.tabset}
-------------------------------------------------------------------
***Data***

The data is presented below has 14471 rows and 15 columns. I have used the 'Reactable' package which helps to interactively access the raw data frame.

```{r,message=F, warning=F, echo = FALSE}
# Displaying the head of the dataframe
reactable(df)
```

Exploratory Analysis{.storyboard}
====================================================================

***Exploratory analysis is performed to-***
***1. Check Whether the genes are normally distributed after the log2 transformation*** 
***2. Correlation-plot of check whether their expression is differentially expressed***


```{r,message=F, warning=F, echo=FALSE}
# Calucting average gene expression
df$Liver_Avg <- rowMeans(df[,c(7:9)])
df$Adipose_Avg <- rowMeans(df[,c(10:12)])
df$Marrow_Avg <- rowMeans(df[,c(13:15)])
```

### Distribution of The Gene Expression of Liver, Adipose, and Bone Marrow Tissue.{data-commentary-width=500}
```{r,message=F, warning=F, echo=FALSE}
library(ggplot2)
# Histogram overlaid with kernel density curve
# Overlaid with transparent density plot
# Distribution Curve for liver sample 
ggplot(df, aes(x=Liver_Avg)) + 
    geom_histogram(aes(y=..density..),
                   binwidth=.5,
                   colour="black", fill="white") +
    geom_density(alpha=.2, fill="#FF6666")

# Distribution Curve for adipose sample
ggplot(df, aes(x=Adipose_Avg)) + 
    geom_histogram(aes(y=..density..),
                   binwidth=.5,
                   colour="black", fill="white") +
    geom_density(alpha=.2, fill="#FF6666")

# Distribution Curve for bone marrow sample
ggplot(df, aes(x=Marrow_Avg)) + 
    geom_histogram(aes(y=..density..),
                   binwidth=.5,
                   colour="black", fill="white") +
    geom_density(alpha=.2, fill="#FF6666")


```

***

The histograms are showing that the gene expression data is normally distributed across three different sample types. This is pretty good observation in the sense that most of the genes are expressed by following a specific pattern (going up or down at the same time). Since we have used the log2 transformation, we found the normal distribution is self-explanatory. Even for cancer when there is abnormal expression happens, almost 50% of genes follow the normal distribution. But it is a sanity check before proceeding to the next step.

### Correlation Among Three Different Tissue Types in Term of Gene Expression.{data-commentary-width=500}
```{r,message=F, warning=F, echo=FALSE}
# Correlation between liver and adipose tissue
ggplot(df, aes(x=Liver_Avg, y=Adipose_Avg)) + 
  geom_point(shape=18, color="blue")+
  geom_smooth(method=lm,  linetype="dashed",
              color="darkred", fill="blue")

# Correlation between liver and bone marrow tissue
ggplot(df, aes(x=Liver_Avg, y=Marrow_Avg)) + 
  geom_point(shape=18, color="blue")+
  geom_smooth(method=lm,  linetype="dashed",
              color="darkred", fill="blue")

# Correlation between adipose tissue and bone marrow
ggplot(df, aes(x=Adipose_Avg, y=Marrow_Avg)) + 
  geom_point(shape=18, color="blue")+
  geom_smooth(method=lm,  linetype="dashed",
              color="darkred", fill="blue")

```
***

The correlation graph showing that differential gene expression among three different tissue types is mostly similar rather than differential. For all the graphs, the gene expression in both X and Y-axis falls in the trendline (around ~75% genes). On the other hand, there are several genes that are staying apart from the trendline and are suspected genes that may be expressed differently.


Statistical Analysis{.storyboard}
====================================================================

## Calculting Mean for Each Tissue  Types (Liver, Bone Marrow, Adipose) Across Replicates

```{r,message=F, warning=F, echo=FALSE}
df <- read.csv("DE_Liver_BM_AD_MSC_3.csv")
# Adding three average column to the data
df$Liver_Avg <- rowMeans(df[,c(7:9)])
df$Adipose_Avg <- rowMeans(df[,c(10:12)])
df$Marrow_Avg <- rowMeans(df[,c(13:15)])

# Converting Dataframe to matrix
dfm <- data.matrix(df)

# Doing pairwise t-test between tissue type
pvals_liver_adipose=apply(dfm,1,function(x) {t.test(x[7:9],x[10:12])$p.value})
pvals_liver_marrow=apply(dfm,1,function(x) {t.test(x[7:9],x[13:15])$p.value})
pvals_adipose_marrow=apply(dfm,1,function(x) {t.test(x[10:12],x[13:15])$p.value})

# Creating new datframe to add the p-values 
df2 <- cbind(data.frame(df), data.frame(pvals_liver_adipose), data.frame(pvals_liver_marrow),data.frame(pvals_adipose_marrow))

# Subsetting the dataframe to show results for t-test
df_result <- df2[,c(1,3,16:21)]
reactable(df_result)
```
***Result***

Using the above table we have found a couple of important information. It is impossible to go through all of the genes that are significantly different between the two tissue types. Between the liver and the adipose tissue, the top three genes that are differentially expressed are LAMA5, CTC518P12.6, and C5orf38. Between the liver and bone marrow tissue types, the top three genes are CCDC170, AC010894.5, and FGD4. Finally, between adipose and bone marrow tissue types, the top three genes are ACTN4P1, C1orf54, and POLR3B. For further information, you can access the above interactive table.



```{r}
# These function counts the number of statitically significant number of genes among 14471 genes from each pvalue column 
CountMySignificant <- function(x) {
  count = 0
  for(i in 1:length(x))
  {
    if(x[i] < 0.05)
    {
      count = count + 1
    }
  }
  return(count)
}
```
```{r,message=F, warning=F, echo=TRUE, include = FALSE}
CountMySignificant(df2$pvals_liver_adipose)
CountMySignificant(df2$pvals_liver_marrow)
CountMySignificant(df2$pvals_adipose_marrow)

```
***2402 out of 14471(16.6 %) genes are differentially expressed between liver and adipose tissue***

***2918 out of 14471(20.2 %) genes are differentially expressed between liver and bone marrow tissue***

***1604 out of 14471(11.1 %) genes are differentially expressed between adipose and bone marrow tissue***

Biography{.storyboard}
====================================================================
**Anamul Haque**

*Ph.D. Student, Biomedical Data Science, and Informatics Program,*

*Clemson University*

*ahaque@clemson.edu*

*540-525-732*


I am a second-year graduate student at the School of Computing and currently working with Professor William Richardson at the Clemson Bioengineering department. Before joining Clemson University, I achieved my Masters in Biology from Virginia Tech, and Bachelors in Microbiology from the University of Dhaka, Bangladesh. I have 8 years of teaching and research experience in academia and research institutes as a research and teaching assistant. I have enrolled in this class because I never took a Statistics class in R though I have taken several classes in statistics. This class helped me to understand R more profoundly than any other class. For my future, I am interested to work in the field of Biomedical Informatics, especially in the field of Text Mining and Natural Language Processing.  Driving is my passion and I drive almost 30,000 miles a year. In my leisure time, I also love to tweak with computer hardware. I am currently living in Greenwood, South Carolina with my wife who is an employee of the Greenwood Genetic Center. 


### Photo of Anamul Haque


```{r, out.width= "200px", echo=FALSE, fig.align='center',fig.cap = "Anamul Haque"}
knitr::include_graphics("photo_anamulhaque.jpg")
```